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Abstract 

We provide a principled way for investigators to analyze ran¬ 
domized experiments when the number of covariates is large. 
Investigators often use linear multivariate regression to analyze 
randomized experiments instead of simply reporting the difference 
of means between treatment and control groups. Their aim is to 
reduce the variance of the estimated treatment effect by adjusting 
for covariates. If there are a large number of covariates relative 
to the number of observations, regression may perform poorly 
because of overfitting. In such cases, the Lasso may be helpful. We 
study the resulting Lasso-based treatment effect estimator under 
the Neyman-Rubin model of randomized experiments. We present 
theoretical conditions that guarantee that the estimator is more 
efficient than the simple difference-of-means estimator, and we 
provide a conservative estimator of the asymptotic variance, which 
can yield tighter confidence intervals than the difference-of-means 
estimator. Simulation and data examples show that Lasso-based 
adjustment can be advantageous even when the number of 
covariates is less than the number of observations. Specifically, a 
variant using Lasso for selection and OLS for estimation performs 
particularly well, and it chooses a smoothing parameter based on 
combined performance of Lasso and OLS. 
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1. Introduction 

Randomized experiments are widely used to measure the 
efhcacy of treatments. Randomization ensures that treat¬ 
ment assignment is not influenced by any potential con¬ 
founding factors, both observed and unobserved. Exper¬ 
iments are particularly useful when there is no rigorous 
theory of a system’s dynamics, and full identification of 
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confounders would be impossible. This advantage was 
cast elegantly in mathematical terms in the early 20th 
century by Jerzy Neyman, who introduced a simple model 
for randomized experiments, which showed that the dif¬ 
ference of average outcomes in the treatment and control 
groups is statistically unbiased for the Average Treatment 
Effect (ATE) over the experimental sample [1]. 

However, no experiment occurs in a vacuum of scien¬ 
tific knowledge. Often, baseline covariate information 
is collected about individuals in an experiment. Even 
when treatment assignment is not related to these covari¬ 
ates, analyses of experimental outcomes often take them 
into account with the goal of improving the accuracy of 
treatment effect estimates. In modern randomized ex¬ 
periments, the number of covariates can be very large— 
sometimes even larger than the number of individuals in 
the study. In clinical trials overseen by regulatory bod¬ 
ies like the FDA and MHRA, demographic and genetic 
information may be recorded about each patient. In ap¬ 
plications in the tech industry, where randomization is 
often called A/B testing, there is often a huge amount of 
behavioral data collected on each user. However, in this 
‘big data’ setting, much of this data may be irrelevant to 
the outcome being studied or there may be more potential 
covariates than observations, especially once interactions 
are taken into account. In these cases, selection of impor¬ 
tant covariates or some form of regularization is necessary 
for effective regression adjustment. 

To ground our discussion, we examine a randomized 
trial of the Pulmonary Artery Catheter (PAC) that was 
carried out in 65 intensive care units in the UK between 
2001 and 2004, called PAC-man [2]. The PAC is a moni¬ 
toring device commonly inserted into critically ill patients 
after admission to intensive care, and it provides a contin¬ 
uous measurement of several indicators of cardiac activ¬ 
ity. However, insertion of PAC is an invasive procedure 
that carries some risk of complications (including death), 
and it involves significant expenditure both in equipment 
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costs and personnel [3] . Controversy over its use came to 
a head when an observational study found that PAG had 
an adverse effect on patient survival and led to increased 
cost of care [4] . This led to several large-scale randomized 
trials, including PAC-man. 

In the PAC-man trial, randomization of treatment was 
largely successful, and a number of covariates were mea¬ 
sured about each patient in the study. If covariate inter¬ 
actions are included, the number of covariates exceeds the 
number of individuals in the study; however, few of them 
are predictive of the patient’s outcome. As it turned out, 
the (pre-treatment) estimated probability of death was 
imbalanced between the treatment and control groups (p 
= 0.005, Wilcoxon rank sum test). Because the control 
group had, on average, a slightly higher risk of death, 
the unadjusted difference-in-means estimator may over¬ 
estimate the benefits of receiving a PAC. Adjustment for 
this imbalance seems advantageous in this case, since the 
pre-treatment probability of death is clearly predictive of 
health outcomes post-treatment. 

In this paper, we study regression-based adjustment, 
using the Lasso to select relevant covariates. Standard 
linear regression based on ordinary least squares suffers 
from over-fitting if a large number of covariates and in¬ 
teraction terms are included in the model. In such cases, 
researchers sometimes perform model selection based on 
observing which covariates are unbalanced given the real¬ 
ized randomization. This generally leads to misleading in¬ 
ferences because of incorrect test levels [5]. The Lasso [6] 
provides researchers with an alternative that can mitigate 
these problems and still perform model selection. We de¬ 
fine an estimator, ATALasso, which is based on running 
an /i-penalized linear regression of the outcome on treat¬ 
ment, covariates and, following the method introduced 
in [7], treatment x covariate interactions. Because of 
the geometry of the h penalty, the Lasso will usually set 
many regression coefficients to 0, and is well defined even 
if the number of covariates is larger than the number of 
observations. The Lasso’s theoretical properties under 
the standard linear model have been widely studied in 
the last decade; consistency properties for coefficient es¬ 
timation, model selection, and out-of-sample prediction 
are well understood (see [8] for an overview). 

In the theoretical analysis in this paper, instead of 
assuming that the standard linear model is the true 
data-generating mechanism, we work under the afore¬ 
mentioned non-parametric model of randomization intro¬ 
duced by Neyman [I] and popularized by Donald Ru¬ 
bin [9]. In this model, the outcomes and covariates are 
fixed quantities, and the treatment group is assumed to 
be sampled without replacement from a finite population. 
The treatment indicator, rather than an error term, is 
the source of randomness, and it determines which of two 
potential outcomes is revealed to the experimenter. Un¬ 


like the standard linear model, the Neyman-Rubin model 
makes few assumptions not guaranteed by the random¬ 
ization itself. The setup of the model does rely on the 
stable unit treatment value assumption (SUTVA), which 
states that there is only one version of treatment, and that 
the potential outcome of one unit should be unaffected 
by the particular assignment of treatments to the other 
units; however it makes no assumptions of linearity or ex¬ 
ogeneity of error terms. Ordinary Least Squares (OLS) 
[10] [11] [7], logistic regression [12], and post-stratification 
[13] are among the adjustment methods that have been 
studied under this model. 

To be useful to practitioners, the Lasso-based treat¬ 
ment effect estimator must be consistent and yield a 
method to construct valid confidence intervals. We out¬ 
line conditions on the covariates and potential outcomes 
that will guarantee these properties. We show that an up¬ 
per bound for the asymptotic variance can be estimated 
from the model residuals, yielding asymptotically con¬ 
servative confidence intervals for the average treatment 
effect which can be substantially narrower than the un¬ 
adjusted confidence intervals. Simulation studies are pro¬ 
vided to show the advantage of the Lasso adjusted esti¬ 
mator and to show situations where it breaks down. We 
apply the estimator to the PAC-man data, and compare 
the estimates and confidence intervals derived from the 
unadjusted, OLS-adjusted, and Lasso-adjusted methods. 
We also compare different methods of selecting the Lasso 
tuning parameter on this data. 

2. Framework and definitions 

We give a brief outline of the Neyman-Rubin model for 
a randomized experiment; the reader is urged to consult 
[1], [9], and [14] for more details. We follow the notation 
introduced in [10] and [7]. For concreteness, we illustrate 
the model in the context of the PAC-man trial. 

For each individual in the study, the model assumes 
that there exists a pair of quantities representing his/her 
health outcomes under the possibilities of receiving and 
not receiving the catheter. These are called the potential 
outcomes under treatment and control, and are denoted 
as Gi and bi, respectively. In the course of the study, 
the experimenter observes only one of these quantities 
for each individual, since the catheter is either inserted 
or not. The causal effect of the treatment on individual i 
is defined, in theory, to be ai — bi, but this is unobservable. 
Instead of trying to infer individual-level effects, we will 
assume that the intention is to estimate the average causal 
effect over the whole population, as outlined in the next 
section. 

In the mathematical specification of this model we con¬ 
sider the potential outcomes to be fixed, non-random 
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quantities, even though they are not all observable. The 
only randomness in the model comes from the assignment 
of treatment, which is controlled by the experimenter. We 
define random treatment indicators T^, which take on a 
value 1 for a treated individual, or 0 for an untreated indi¬ 
vidual. We will assume that the set of treated individuals 
is sampled without replacement from the full population, 
where the size of the treatment group is fixed beforehand; 
thus the Ti are identically distributed but not indepen¬ 
dent. The model for the observed outcome for individual 
i, defined as T;, is thus 

Yi = Tiai -f (1 — Ti)bi. 

This equation simply formalizes the idea that the exper¬ 
imenter observes the potential outcome under treatment 
for those who receive the treatment, and the potential 
outcome under control for those who do not. 

Note that the model does not incorporate any covari¬ 
ate information about the individuals in the study, such 
as physiological characteristics or health history. How¬ 
ever, we will assume we have measured a vector of base¬ 
line, pre-experimental covariates for each individual i. 
These might include, for example, age, gender, and ge¬ 
netic makeup. We denote the covariates for individual 
i as the column vector = (x^i, G and the 

full design matrix of the experiment as X = (xi,..., x„)^. 
In the theoretical results, we will assume that there is a 
correlational relationship between an individual’s poten¬ 
tial outcomes and covariates, but we will not assume a 
generative statistical model. 

Define the set of treated individuals as A = {i G 
{1,..., n} : Ti = 1}, and similarly define the set of control 
individuals as B. Define the number of treated and con¬ 
trol individuals as ua = |A| and ns = \B\, respectively, 
so that UA -I- nB = n. To indicate averages of quantities 
over these individuals, we introduce the notation ~a and 
~B- Thus, for example, the average value of the potential 
outcomes and the covariates in the treatment group are 

b,A = 

respectively. Note that these are random quantities in 
this model, since the set A is determined by the random 
treatment assignment. When we want to take the average 
over the whole population, we will use the notation b such 
as 


3. Treatment effect estimation 

Our main inferential goal will be average effect of the 
treatment over the whole population in the study. In 
a trial such as PAC-man, this represents the difference 
between the average outcome if everyone had received the 
catheter, and the average outcome if no one had received 
it. This is defined as 


ATE = a-b. 


The most natural estimator arises by replacing the pop¬ 
ulation averages with the sample averages: 

AT Aunadj = aA — bB, 


The subscript “unadj” indicates an estimator without re¬ 
gression adjustment. The foundational work in [1] points 
out that, under a randomized assignment of treatment, 
ATAunadj is Unbiased for ATE, and derives a conserva¬ 
tive procedure for estimating its variance. 

While ATAunadj is an attractive estimator, covariate 
information can be used to make adjustments in the hope 
of reducing variance. A commonly used estimator is 


ATAadj — 


a A - (XA - x)^ - bB- (xb - x)^ 


^ (a) (6) 

where /3 , /3 G are adjustment vectors for the 

treatment and control groups, respectively, as indicated 
by the superscripts. The terms 5c a — x and xb — x rep¬ 
resent the fluctuation of the covariates in the subsample 
relative to the full sample, and the adjustment vectors fit 
the linear relationships between the covariates and poten¬ 
tial outcomes under treatment and control. For example, 
in the PAC-man trial, this would help alleviate the imbal¬ 
ance in the pre-treatment estimated probability of death: 
the corresponding element of xb — x would be positive 
(due to the higher average probability of death in the 

'■{b) 

control group), the corresponding element of f3 would 
be negative (a higher probability of death correlates with 
worse health outcomes), so the overall treatment effect 
estimate would be adjusted downwards. This procedure 
is equivalent to imputing the unobserved potential out¬ 
comes; if we define 

OB = UA + (xb - xa)^^^“\ bA = bB + (5ca - 5c b)'^ 


we can form the equivalent estimator 


— vn 7 vn 7 — vn 

CL — TL ^ ^ ^ ^ • 


ATEg^di = n ^ (uAttA + nBttB) 


{uBbB 


■ nAbA 


Note that the averages of potential outcomes over the 
whole population are not considered random, but are un¬ 
observable. 
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If we consider these adjustment vectors to be fixed (non- 
random), or if they are derived from an independent data 
source, then this estimator is still unbiased, and may 






have substantially smaller asymptotic and finite-sample 
variance than the unadjusted estimator. This allows for 
construction of tighter confidence intervals for the true 
treatment effect. 

In practice, the “ideal” linear adjustment vectors, lead¬ 
ing to a minimum-variance estimator of the form of 
ATEad'i, cannot be computed from the observed data. 
However, they can be estimated, possibly at the expense 
of introducing modest finite-sample bias into the treat¬ 
ment effect estimate. In the classical setup, when the 
number of covariates is relatively small, ordinary least 
squares (OLS) regression can be used. The asymptotic 
properties of this kind of estimator are explored under 
the Neyman-Rubin model in [11], [12], and [7]. We will 
follow a particular scheme which is studied in [7] and 
shown to have favorable properties: we regress the out¬ 
come on treatment indicators, covariates, and treatment 
X covariate interactions. This is equivalent to running 
separate regressions in the treatment and control groups 
of outcome against an intercept and covariates. If we de- 

^ (a) (&) 

fine /3 ols /Sqls the coefficients from the separate 

regressions, then the estimator is 


A-TEols = [oa - (xa - x)^ ^OLS 


be - (xb -xj'^^oLS 


This has some finite-sample bias, but [7] shows that it 
vanishes quickly at the rate of 1 /n under moment condi¬ 
tions on the potential outcomes and covariates. Moreover, 
for a fixed p, under regularity conditions, the inclusion 
of interaction terms guarantees that it never has higher 
asymptotic variance than the unadjusted estimator, and 
asymptotically conservative confidence intervals for the 
true parameter can be constructed. 

In modern randomized trials, where a large number 
of covariates are recorded for each individual, p may be 
comparable to or even larger than n. In this case OLS 
regression can overfit the data badly, or may even be ill- 
posed, leading to estimators with large finite-sample vari¬ 
ance. To remedy this, we propose estimating the adjust¬ 
ment vectors using the Lasso [6] . The adjustment vectors 
would take the form 


a(“) 

/^Lasso = argmin 
P 


1 

2nA 


i^A 


ai - CA - (xi - xa)^/3^ 


-l-Aa ^ \l 3 j\ 


( 1 ) 


x(b) 

/3Lasso = argmin 
P 


1 

2nB 


(bi - 6s - (xi - xb)^/3^ 


iGB 


+ ^6 ^ j^jj 

1 = 1 


( 2 ) 


and the proposed Lasso adjusted ATE estimator is^ 


ATELasso = 


_,T a(“) 

OA - (XA - X) /3Lasso 


- bB - (XB - ^lLso 


Here Aa and Af, are regularization parameters for the 
Lasso which must be chosen by the experimenter; sim¬ 
ulations show that cross-validation works well. In the 
next section, we study this estimator under the Neyman- 
Rubin model, and provide conditions on the potential 
outcomes, the covariates and the regularization parame¬ 
ters under which ATEi^^sso enjoys similar asymptotic and 
finite-sample advantages as ATEols- 

It is worth noting that when two different adjustments 
are made for the treatment and control groups as in [7] 
and here, the covariates do not have to be the same for 
the two groups. However, when they are not the same, 
the Lasso or OLS adjusted estimators are no longer guar¬ 
anteed to have smaller or equal asymptotic variance than 
the unadjusted one, even in the case of fixed p. In prac¬ 
tice, one may still choose between the adjusted and unad¬ 
justed estimators based on the widths of the correspond¬ 
ing confidence intervals. 


4. Theoretical results 

4.1. Notation 

For a vector f3 G RP and a subset S C {1, ■■■,p}, let (3j be 
the j-th component of /3, f3g = {j3j : j G S)'^, 5'“ be the 
complement of S, and [S’] the cardinality of the set S. For 
any column vector u = (m,..., Um)'^, let llujj| = 
l|u||i = ThLi\uz\, Ijulloo = maxi=i,...^™ [mj] and l]ullo = 
[{j : Uj 0}\. For a given m x m matrix D, let Amin (LI) 
and Amax(LI) be the smallest and largest eigenvalues of 
D respectively, and D~^ the inverse of the matrix D. 
Let and denote convergence in distribution and in 
probability, respectively. 


4.2. Decomposition of the potential 
outcomes 


The Neyman-Rubin model does not assume a linear 
relationship between the potential outcomes and the co¬ 
variates. In order to study the properties of adjustment 
under this model, we decompose the potential outcomes 
into a term linear in the covariates and an error term. 

^To simplify the notation, we omit the dependence of /QLassoi 
(^Lasso) population size n. 


4 















Given vectors of coefficients £ R^’, we write^ for 

i = 


Oi = a -1- (Xi - x)^/3^“^ -1- e^fi\ 

(3) 

b,=b+(x,-xfj3<-'’^+e^2\ 

(4) 


Note that we have not added any assumptions to the 
model; we have simply defined nnit-level residuals, 
and ef\ given the vectors All the quantities 

in 3 and 4 are fixed, deterministic numbers. It is easy to 
verify that — 0. In order to pursue a theory 

for the Lasso, we will add assumptions on the populations 
of Oi’s, 6i’s, and x^’s, and we will assume the existence 
of such that the error terms satisfy certain as¬ 

sumptions. 


Definition 2 Define Sn to be the maximum covariance 
between the error terms and the covariates. 

5n = max < max 

uj = a,b I j 

The following conditions will guarantee that the Lasso 
consistently estimates the adjustment vectors 
at a fast enough rate to ensure asymptotic normality of 
AT i^Lasso ■ It is an open question whether a weaker form 
of consistency would be sufficient for our results to hold. 


{fit'’ - e^“’) 


• (9) 


Condition 4 Decay and scaling. 
max s*^**^}. 


Sn. = ^ 


3\/logp 


Let s 


( 10 ) 


4.3. Conditions 


(slogp)/v^= o(l). (11) 


We will need the following to hold for both the treatment 
and control potential outcomes. The first set of assump¬ 
tions (1-3) are similar to those found in [7]. 

Condition 1 Stability of treatment assignment probabil¬ 
ity. 

UA/n^PA, as n —>■ oo (5) 

for some pA £ (0,1). 

Condition 2 The centered moment conditions. There 
exists a fixed constant L > 0 such that, for all n = 1, 2,... 
and j = 1, ...,p, 

( 6 ) 

< L-, < L. (7) 

Conditions The means 127=1 > 

converge to 

finite limits. 

Since we consider the high-dimensional setting where p 
is allowed to be much larger than n, we need additional 
assumptions to ensure that the Lasso is consistent for 
estimating ,3^“^ and Before stating them, we define 
several quantities. 

Definition 1 Given and the sparsity measures 
for treatment and control groups, and are defined 
as the number of nonzero elements of (3^°’'^ and i.e., 

s(“) = 10-:/3f) ^ 0}|, = ^0}|, (8) 


Condition 5 Cone invertibility factor. Define the Gram 
matrix as E = n~^ 127=ii^'i ~ ^)(^i “ x)^; There exist 
constants C > 0 and ^ > 1 not depending on n, such that 

llhslli <Cs||Eh||oo, Vh£C, (12) 


with C = {h : ||hsc||i < ,f||hs||i}, and 

S={j-. f3f fi 0 or f3f^ fi 0}. (13) 

Conditions Let t = min {l/70, (3p^)^/70, (3 — 
3p^)^/70}. For constants 0 < r] < and ^ < M < oo, 
assume the regularization parameters of the Lasso belong 
to the sets 


V \ PA 


a,£(1m|x 

V \ PB 


+ dnj , 

(14) 

+ <5ra ] • 

(15) 


Denote respectively the population variances of 
and and the population covariance between them by 


CTe(-) = n ^ 






_i (a) (b) 

= n ^}2i=A ’e] 


Theorem 1 Assume conditions 1 through 6 hold for 
some (3^°'^ and f3^’’K Then 

y/fi (iTALasso - AT E) A M (O, (T^) (16) 


respectively. We will allow 3*-“^ and to grow with n, 
though the notation does not explicitly show this. 

^Again, we omit the dependence of A^, A(,, eG) and 

on n. 


where 


a = lim 

n—^oo 


1 - PA _2 , 


PA 


3- PA 




2a, 


e(a)e(6) 


(17) 
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The proof of Theorem 1 is given in the supporting in¬ 
formation. It is easy to show, as in the following corollary 
of Theorem 1, that the asymptotic variance of ATifLasso 
is no worse than ATifunadj when /3^“^ and are de¬ 
fined as coefficients of regressing potential outcomes on 
a subset of covariates. More specifically, suppose there 
exists a subset J C {1, ...,p}, such that 

/3(“) = ((/3(“))^,0)^, / 3 W = ((/3^'))^,0)^, (18) 

where /3j“^ and /3j^ are the population level OLS coef¬ 
ficients for regressing the potential outcomes a and b on 
the covariates in the subset J with intercept, respectively. 

Corollary 1 For and defined in 18 and some 
Xa and Xb, assume conditions 1 through 6 hold. Then the 
asymptotic variance of yjn ATi^Lasso is no greater than 
that of the y/n ATE^nadj- The difference is 
where 

A = - hm \\X(3e\\1 < 0, (19) 

n—>-oo 

/3l, = (1-Pa)/3^“^+Pa/3<'^ (20) 

Remark 1. If, instead of Condition 6, we assume that 
the covariates are uniformly bounded, i.e., max^j \xij\ < 
L, then the fourth moment condition on the error terms, 
given in 7, can be weakened to a second moment condi¬ 
tion. While we do not prove the necessity of any of our 
conditions, our simulation studies show that the distribu¬ 
tions of the unadjusted and the Lasso adjusted estimator 
may be non-normal when: (1) The covariates are gener¬ 
ated from Gaussian distributions and the error terms do 
not satisfy second moment condition, e.g., being generated 
from a t distribution with one degree of freedom; or (2) 
The covariates do not have bounded fourth moments, e.g., 
being generated from a t distribution with three degrees of 
freedom. See the histograms in Figure 1 where the cor¬ 
responding p-values of Kolmogorov-Smimov testing for 
normality are less than 2.2e— 16. These findings indicate 
that our moment conditions cannot be dramatically weak¬ 
ened for asymptotic normality. However, we also find 
that the Lasso adjusted estimator still has smaller vari¬ 
ance and mean squared error than the unadjusted estima¬ 
tor, even when these moment conditions do not hold. In 
practice, when the covariates do not have bounded fourth 
moments, one may perform some transformation — e.g., a 
logarithm transformation—to ensure that the transformed 
covariates have bounded fourth moments while having a 
sufficiently large variance so as to retain useful informa¬ 
tion. We leave it as future work to explore the properties 
of different transformations. 

Remark 2. Statement 11, typically required in de¬ 
biasing the Lasso [15], is stronger by a factor of ^/logp 
than the usual requirement for li consistency of the Lasso. 


Error term from t1 distribution 
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Figure 1: Histograms of the unadjusted estimator and 
the Lasso adjusted estimator when the moment condi¬ 
tions do not hold. We select the tuning parameters for 
Lasso using 10-fold cross validation. The potential out¬ 
comes are simulated from linear regression model and 
then kept fixed, see more details in the supporting infor¬ 
mation. For the upper two subplots, the error terms are 
generated from t distribution with one degree of freedom 
and therefore do not satisfy second moment condition; 
while for the lower two subplots, the covariates are gen¬ 
erated from t distribution with there degrees of freedom 
and thus violate fourth moment condtion. 


Remark 3. Condition 5 is slightly weaker than the 
typical restricted eigenvalue condition for analyzing the 
Lasso. 

Remark 4. If we assume dn = O ^which satis¬ 
fies 10, then Condition 6 requires that the tuning parame¬ 
ters are proportional to which is typically assumed 

for the Lasso in the high-dimensional linear regression 
model. 

Remark 5. For fixed p, = 0 in 9, Condition f holds 
automatically, and Condition 5 holds when the smallest 
eigenvalue of S is uniformly bounded away from 0. In this 
case. Corollary 1 reverts to Corollary 1.1. in [7]. When 
these conditions are not satisfied, we should set Xa and Xb 
to be large enough to cause the Lasso adjusted estimator 
to revert to the unadjusted one. 
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5. Neyman-type conservative 
variance estimate 


Therefore, for the and defined in [18], the limit 
of CTLasso greater than that of and the differ¬ 

ence is 


We note that the asymptotic variance in Theorem 1 in¬ 
volves the cross-product term crg(a) which is not consis¬ 
tently estimable in the Neyman-Rubin model as at and bi 
are never simultaneously observed. However, we can give 
a Neyman-type conservative estimate of the variance. Let 



( 22 ) 

where and are degrees of freedom defined by 


d/(“) _ -I- 1 — lli^Laisollo + 1; 


- lim -V — 

n-^oo n PA 

i = l 

Remark 7. With the conservative variance estimate in 
Theorem 2, the Lasso adjusted confidence interval is also 
valid for the PATE (Population Average Treatment Ef¬ 
fect) if there is a super population of size N with N > n. 
Remark 8. The extra Condition 7 is used to obtain the 
following bounds for the number of selected covariates by 
the Lasso: max (s(“\ = Op(min (n^, ns)). Condi¬ 

tion 7 can he removed from Theorem 2 if we redefine 
and without adjusting the degrees of freedom, i.e.. 



(x,-x)-(/3(“)) 


-f 


I-PA 


(x,-x)-(/3('’)) 


^^/(')=s(') + l = ||/9ilL||o + l. 

Define the variance estimate of •v/n(HTi?Lasso — ATE) 
as follows: 


'' Lasso 


UA 


g(a) 


ub 




(23) 


Condition 7 For the Cram matrix S defined in Condi¬ 
tion 5, the largest eigenvalue is bounded away from oo, 
that is, there exists a constant Amax < oo such that 


X 


max 


(Lj) ^ Ajjiax 


Theorem 2 Assume conditions in Theorem 1 and con¬ 
dition 7 hold. Then (TLasso converges in probability to 


— lim a] 

Pj ^ n^oo 


(a) 


- - lim 

1 — PA n-i-oo 


which is greater than or equal to the asymptotic variance 
of y/n{ATEi^asso — ATE). The difference is 


lim — 

n—>-oo Tl 


E 


-bi- ATE - (xi - x)^(/3 


(a) 



Remark 6. The Neyman-type conservative variance es¬ 
timate for the unadjusted estimator is given by 


.. 2 

^unadj 


n 1 
WA WA - 1 


E(“' 

i^A 


- \2 , 
CL A) + 


n 1 
riB riB — 


1 


Y^{bi-bB)\ 

ieB 


which, under second moment conditions of potential out¬ 
comes a and b, converges in probability to 


11 1 1 _ 

— lim — y^(ai — d)^-I- -- lim —'S^{bi — b)^. 

Va n^oo n ^' 1 — Da ri->oo n ^' 
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and define (d*)2^,,^^ = ^{a*)l^^^ + :^{a*)l^,y It follows 
from the bounds for max {s^‘^\ s^^^) that ((t^(a), 17^(6)) and 
((^*)^(.),(^*)^(.,) have the same asymptotic property. 


Theorem 3 Assume the conditions in Theorem 1 hold. 
Then converges in probability to 


(tIm + -j— 

Pa i — Pa 


— lim 
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Remark 9. Though (<T*)^asso has the same limit as 
(TLasso; owr simulation experience shows that, in finite 
samples, the confidence intervals based on (d*)Lasso 
yield low coverage probabilities (e.g., the coverage prob¬ 
ability for 95% confidence interval can be only 80%^. 
Hence, we recommend readers to use (JLasso practice. 


6. Related work 

The Lasso has already made several appearances in the 
literature on treatment effect estimation. In the con¬ 
text of observational studies, [15] constructs confidence 
intervals for preconceived effects or their contrasts by de¬ 
biasing the Lasso adjusted regression, [16] employs the 
Lasso as a formal method for selecting adjustment vari¬ 
ables via a two-stage procedure which concatenates fea¬ 
tures from models for treatment and outcome, and simi¬ 
larly, [17] gives very general results for estimating a wide 
range of treatment effect parameters, including the case 
of instrumental variables estimation. In addition to the 
Lasso, [18] considers nonparametric adjustments in the 
estimation of ATE. In works such as these, which deal 
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with observational studies, confounding is the major is¬ 
sue. With confounding, the naive difference-in-means es¬ 
timator is biased for the true treatment effect, and ad¬ 
justment is used to form an unbiased estimator. How¬ 
ever, in our work, which focuses on a randomized trial, 
the difference-in-means estimator is already unbiased; ad¬ 
justment reduces the variance while, in fact, introduc¬ 
ing a small amount of finite-sample bias. Another major 
difference between this prior work and ours is the sam¬ 
pling framework: we operate within the Neyman-Rubin 
model with fixed potential outcomes for a finite popula¬ 
tion, where the treatment group is sampled without re¬ 
placement, while these papers assume independent sam¬ 
pling from a probability distribution with random error 
terms. 

Our work is related to the estimation of heterogeneous 
or subgroup-specific treatment effects; including interac¬ 
tion terms to allow the imputed individual-level treat¬ 
ment effects to vary according to some linear combination 
of covariates. This is pursued in the high-dimensional set¬ 
ting in [19]; this work advocates solving the Lasso on a 
reduced set of modified covariates, rather than the full set 
of covariate x treatment interactions, and includes exten¬ 
sions to binary outcomes and survival data. The recent 
work in [20] considers the problem of designing multiple¬ 
testing procedures for detecting subgroup-specific treat¬ 
ment effects; they pose this as an optimization over test¬ 
ing procedures where constraints are added to enforce 
guarantees on type-I error rate and power to detect ef¬ 
fects. Again, the sampling framework in these works is 
distinct from ours; they do not use the Neyman-Rubin 
model as a basis for designing the methods or investigat¬ 
ing their properties. 

7. PAC data illustration and 
simulations 

We now return to the PAC-man study introduced earlier. 
We examine the data in more detail and explore the re¬ 
sults of several adjustment procedures. There were 1013 
patients in the PAC-man study: 506 treated (managed 
with PAC) and 507 control (managed without PAC, but 
retaining the option of using alternative devices). The 
outcome variable is quality-adjusted life years (QALYs). 
One QALY represents one year of life in full health; in- 
hospital death corresponds to a QALY of zero. We have 
59 covariates about each individual in the study; we in¬ 
clude all main effects as well as 1113 two-way interactions, 
and form a design matrix X with 1172 columns and 1013 
rows. See Appendix B for more details on the design 
matrix. 

The assumptions that underpin the theoretical guaran¬ 
tees of the ATAuasso estimator are, in practice, not explic¬ 


itly checkable, but we attempt to inspect the quantities 
that are involved in the conditions to help readers make 
their own judgement. The uniform bounds on the fourth 
moments refer to a hypothetical sequence of populations; 
these cannot be verified given that the investigator has a 
single dataset. However, as an approximation, the fourth 
moments of the data can be inspected to ensure that they 
are not too large. In this data set, the maximum fourth 
moment of the covariates is 37.3, which is indicative of a 
heavy-tailed and potentially destabilizing covariate; how¬ 
ever, it occurs in an interaction term not selected by the 
lasso, and thus does not influence the estimate^. Check¬ 
ing the conditions for high-dimensional consistency of the 
Lasso would require knowledge of the unknown active 
set S, and moreover, even if it were known, calculating 
the cone invertibility factor would involve an infeasible 
optimization. This is a general issue in the theory of 
sparse linear high-dimensional estimation. To approxi¬ 
mate these conditions, we use the bootstrap to estimate 
the active set of covariates S and the error terms 
and See the supporting information for more details. 
Our estimated S contains 16 covariates and the estimated 
second moments of and are 11.8 and 12.0, respec¬ 
tively. The estimated maximal covariance Sn equals 0.34 
and the scaling {s\ogp)/y/n is 3.55. While this is not 
close to zero, we should mention that the estimation of <5„ 
and {s log p)/y/n can be unstable and less accurate since 
it is based on a subsample of the population. As an ap¬ 
proximation to Condition 5, we examine the largest and 
smallest eigenvalues of the sub-Gram matrix (l/n)XgX 5 , 
which are 2.09 and 0.18 respectively. Thus the quantity 
in Condition 5 seems reasonably bounded away from zero. 

We now estimate the ATE using the unadjusted estima¬ 
tor, the Lasso adjusted estimator and the OLS adjusted 
estimator which is computed based on a sub-design ma¬ 
trix containing only the 59 main effects. We also present 
results for the two-step estimator ATALasso-i-OLS which 
adopts the Lasso to select covariates and then uses OLS 
to refit the regression coefficients. See [22-25] for statis¬ 
tical properties of Lasso-|-OLS estimator in linear regres- 
(®) 

sion model. Let /3 be the Lasso estimator defined in 1 
(we omit the subscript “Lasso” for the sake of simplicity) 

and let = {j : ,9^ ^ A 0} be the support of ,9^ \ The 
Lasso-I-OLS adjustment vector /3 l1Lo+ols treatment 


^The fourth moments of the covariates are shown in Fugure 13 in 
Appendix F. The covariates with the largest two fourth moments 
(37.3 and 34.9 respectively) are quadratic term interactnew^ 
and interaction term IMscorerct : systemnew. Neither of them 
are selected by the Lasso to do the adjustment. 



group A is defined by 


ATE estimates for PAC data 


^Lasso+OLS = argmin ^ ^ [ai - a a 
I3-. /3,-=0. *6A 

-(Xi - xa)^/3]^ ■ 

We can define the Lasso+OLS adjustment vector 
" (f>) 

/^Lasso+OLS Control group B similarly. Then 

ATALasso+OLS IS given by 

aA - (XA - x)'^^J1L+ols 

- Ib- (xb - xj^^L^^o+oLS • 

In the next paragraph and in Algorithm 1 of Appendix 
F, we show how we adapt the cross-validation proce¬ 
dure to select the tuning parameter for ATALasso-i-OLS 
based on a combined performance of Lasso and OLS, or 
cv(Lasso-l-OLS). 

We use the R package “glmnet” to compute the Lasso 
solution path and select the tuning parameters Aa and Af, 
by 10-fold Cross Validation (CV). To indicate the method 
of selecting tuning parameters, we denote the correspond¬ 
ing estimators as cv(Lasso) and cv(Lasso-l-OLS) respec¬ 
tively. We should mention that for the cv(Lasso-l-OLS) 
adjusted estimator, we compute the CV error for a given 
value of Aa (or A&) based on the whole Lasso-|-OLS proce¬ 
dure instead of just the Lasso estimator (see Algorithm 1 
in Appendix F). Therefore, the cv(Lasso-l-OLS) and the 
cv(Lasso) may select different covariates to do the adjust¬ 
ment. This type of cross validation requires more compu¬ 
tation than the cross validation based on just the Lasso 
estimator since it needs to compute the OLS estimator 
for each fold and each given Aa (or Ab), but it can give 
better prediction and model selection performance. 

Figure 2 presents the ATE estimates along with 95% 
confidence intervals (Cl). The interval lengths are shown 
on top of each interval bar. All the methods give confi¬ 
dence intervals containing 0; hence, this experiment failed 
to provide sufficient evidence to reject the hypothesis that 
PAC did not have an effect on patient QALYs (either 
positive or negative). Since the caretakers of patients 
managed without PAC retained the option of using less 
invasive cardiac output monitoring devices, such an effect 
may have been particularly hard to detect in this experi¬ 
ment. 

However, it is interesting to note that (see Table 1), 
compared with the unadjusted estimator, the OLS ad¬ 
justed estimator causes the ATE estimate to decrease 
(from -0.13 to -0.31), and shortens the confidence interval 
by about 20%. This is due mainly to the imbalance in the 
pre-treatment probability of death, which was highly pre¬ 
dictive of the post-treatment QALYs. The cv(Lasso) ad¬ 
justed estimator yields a comparable ATE estimate and 


ATALasso-t-OLS — 



Unadjusted OLS cv(Lasso) cv(Lasso+OLS) 
method 


Figure 2: ATE estimates (red circles) and 95% confi¬ 
dence intervals (bars) for the PAC data. The numbers 
above each bar are the corresponding interval lengths. 


confidence interval, but the fitted model is more inter¬ 
pretable and parsimonious than the OLS model: it selects 
24 and 8 covariates for treated and control, respectively. 
The cv(Lasso-l-OLS) estimator selects even fewer covari¬ 
ates: 4 and 5 for treated and control, respectively, but 
performs a similar adjustment as the cv(Lasso) (see the 
comparison of fitted values in Figure 12). We also note 
that these adjustments agree with the one performed in 
[13], where the treatment effect was adjusted downwards 
to —0.27 after stratifying into 4 groups based on predicted 
probability of death. 

The covariates selected by Lasso for adjustment are 
shown in Table 2, where “A-A” denote quadratic term 
of the covariate A and “A:B” denote two way interac¬ 
tion between two covariates A and B. Among them, pa¬ 
tient’s age and estimated probability of death (p_death), 
together with the quadratic term “age-age” and interac¬ 
tions “age:p_death” and “p_death:mech_vent"*’”, are the 
most important covariates for the adjustment. The pa¬ 
tients in control group are slightly older and have slightly 
higher risk of death. These covariates are important pre¬ 
dictors of the outcome. Therefore, the unadjusted esti¬ 
mator may overestimate the benefits of receiving PAC. 

Since not all the potential outcomes are observed, we 
cannot know the true gains of adjustment methods. How¬ 
ever, we can estimate the gains via building a simulated 
set of potential outcomes by matching treated units to 
control units on observed covariates. We use the match- 

mechanical ventilation at admission 
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Table 1: Statistics for the PAC illustration 


No. of selected covariates 


Methods 

ATE 

CTATE 

95% confidence interval 

treated 

control 

Unadjusted 

-0.13 

0.081 

[-0.69,0.43] 

- 

- 

OLS 

-0.31 

0.054 

[-0.77,0.14] 

- 

- 

cv(Lasso) 

-0.33 

0.052 

[-0.77,0.12] 

24 

8 

cv(Lasso-l-OLS) 

-0.36 

0.053 

[-0.82,0.09] 

4 

5 


Table 2: Selected covariates for adjustment 


method 

treatment 

covariates 

cv(Lasso-l-OLS) 

treated 

age, p-death, age-age, age:p death 

cv(Lasso-l-OLS) 

control 

age, p death, age-age, age:p death, p death:mech vent 


cv(Lasso) treated pac_rate, age, p_death, age-age, p_death-p_death, region:iin_score, regionisystemnew, 


pac_rate:age, pac_rate:p_death, pac_rate:systemnew, iin_score:interactnew, age:p_death, 
age:glasgow, age:systemnew, interactnew:systemnew, pac_rate:creatinine, 
age:mech_vent, age:respiratory, agexreatinine, interactnew:mech_vent, 
interactnewimale, glasgow:organ_failure, p_death:mech_vent, systemnewimale 
cv(Lasso) control age, p_death, age-age, unitsize:p_death, pac_rate:systeninew, age:p_death, 

interactnew:mech_vent, p_death:mech_vent 

Covariate definitions: age (patient’s age); p_death (baseline probability of death); mech.vent (mechanical ventilation 
at admission); region (geographic region); pac_rate (PAC rate in unit); creatinine, respiratory, glasgow, interactnew, 
organJailure, systemnew, im_score (various physiological indicators). 


ing method described in [21] which gives 1013 observa¬ 
tions with all potential outcomes imputed. We match on 
the 59 main effects only. The ATE is —0.29. We then use 
this synthetic data set to calculate the biases, standard 
deviations (SD) and root-mean square errors (\/MSE) 
of different ATE estimators based on 25000 replicates 
of completely randomized experiment which assigns 506 
subjects to the treated group and the remainders to the 
control group. 

Table 3 shows the results. For all the methods, the 
bias is substantially smaller (by a factor of 100) than 
the SD. The SD and VMSE of the OLS adjusted esti¬ 
mator are both 10.2% smaller than those of the unad¬ 
justed estimator, while the cv(Lasso) and cv(Lasso-l-OLS) 
adjusted estimators further improve the SD and a/MSE 
of the OLS adjusted estimator by approximately 4.7%. 
Moreover, all these methods provide conservative confi¬ 
dence intervals with coverage probabilities higher than 
99%. However, the interval lengths of the OLS, cv(Lasso) 
and cv(Lasso-l-OLS) adjusted estimator are comparable 
and are approximately 10% shorter than that of the un¬ 
adjusted estimator. The cv(Lasso-l-OLS) adjusted esti¬ 
mator is similar to the cv(Lasso) adjusted estimator in 
terms of mean squared error, confidence interval length 
and coverage probability, but outperforms the latter with 


Treated 



Figure 3: Selection stability comparison of cv(Lasso) 
and cv(Lasso-l-OLS) for treatment group. 
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Control 



Figure 4: Selection stability comparison of cv(Lasso) 
and cv(Lasso+OLS) for control group. 

much fewer and more stable covariates in the adjustment 
(see Figures 3 and 4 for the selection frequency of each 
covariate for treatment group and control group respec¬ 
tively). We also show in Figure 14 that the sampling 
distribution of the estimates is very close to Normal. 

We conduct additional simulation studies to evaluate 
the finite sample performance of ATEi^^^so and compare 
it with that of the OLS adjusted estimator and the un¬ 
adjusted estimator. A qualitative analysis of these sim¬ 
ulations yields the same conclusions as presented above; 
however, for the sake of brevity, we defer the simulation 
details in the supporting information. 

8. Discussion 

We study the Lasso adjusted average treatment effect 
(ATE) estimate under the Neyman-Rubin model for ran¬ 
domization. Our purpose in using the Neyman-Rubin 
model was to investigate the performance of the Lasso 
under a realistic sampling framework which does not im¬ 
pose strong assumptions on the data. We provide con¬ 
ditions that ensure asymptotic normality, and provide a 
Neyman-type estimate of the asymptotic variance which 
can be used to construct a conservative confidence in¬ 
terval for the ATE. While we do not require an explicit 
generative linear model to hold, our theoretical analysis 
requires the existence of latent ‘adjustment vectors’ such 
that moment conditions of the error terms are satisfied, 
and that the cone invertibility condition of the sample 


covariance matrix is satisfied in addition to moment con¬ 
ditions for OLS adjustment as in [7]. Both assumptions 
are difficult to check in practice. In our theory, we do not 
address whether these assumptions are necessary for our 
results to hold, though simulations indicate that the mo¬ 
ment conditions cannot be substantially weakened. As 
a by-product of our analysis, we extend Massart’s con¬ 
centration inequality for sampling without replacement, 
which is useful for theoretical analysis under the Neyman- 
Rubin model. Simulation studies and the real data illus¬ 
tration show the advantage of the Lasso-adjusted estima¬ 
tor in terms of estimation accuracy and model interpre¬ 
tation. In practice, we recommend a variant of Lasso, 
cv(Lasso-l-OLS), to select covariates and perform the ad¬ 
justment, since it gives similar coverage probability and 
confidence interval length when compared with cv(Lasso), 
but with far fewer covariates selected. In future work, we 
plan to extend our analysis to other popular methods in 
high-dimensional statistics such as Elastic-Net and ridge 
regression, which may be more appropriate for estimating 
adjusted ATE under different assumptions. 

The main goal of using Lasso in this paper is to re¬ 
duce the variance (and overall mean squared error) of 
ATE estimation. Another important task is to estimate 
heterogenous treatment effects and provide conditional 
treatment effect estimates for subpopulations. When the 
Lasso models of treatment and control outcomes are dif¬ 
ferent, both in variables selected and coefficient values, 
this could be interpreted as modeling treatment effect 
heterogeneity in terms of covariates. However, reducing 
variance of the ATE estimate and estimating heteroge¬ 
nous treatment effects have completely different targets. 
Targeting heterogenous treatment effects may result in 
more variable ATE estimates. Moreover, our simulations 
show that the set of covariates selected by the Lasso is 
unstable and this may cause problems when interpreting 
them as evidence of heterogenous treatment effects. How 
best to estimate such effects is an open question that we 
would like to study in future research. 

9. Materials and Methods 

We did not conduct the PAC-man experiment, and we 
are analyzing secondary data without any personal iden¬ 
tifying information. As such, this study is exempt from 
human subjects review. The original experiments under¬ 
went human subjects review in the UK [2]. 
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Table 3: Statistics for the PAC synthetic data set 


No. of selected covariates 



Bias 

SD 


Coverage (%) 

Length 

treated 

control 

Unadjusted 

0.001(0) 

0.20(0.02) 

0.20(0.02) 

99 

1.06 

- 

- 

OLS 

0.002(0) 

0.18(0.02) 

0.18(0.02) 

99 

0.95 

- 

- 

cv(Lasso) 

0.001(0) 

0.17(0.02) 

0.17(0.02) 

99 

0.94 

25(23) 

15(14) 

cv(Lasso-l-OLS) 

0.000(0) 

0.17(0.02) 

0.17(0.02) 

99 

0.95 

6(6) 

4(3) 


The numbers in parentheses are the corresponding standard errors estimated by using the bootstrap with B = 500 
resamplings of the ATE estimates. 
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A. Simulation 


where ts denotes the t distribution with three degrees of 
freedom. This ensures that the treatment effects are not 
not constant across individuals, and that the linear model 
does not hold in this simulation. The error terms and 
are generated according to the following linear model 
with some hidden covariates z, : 


In this section we carry out simulation studies to evaluate 
the finite sample performance of ATALasso estimator. We 
also present results for the ATEohS estimator when p < n 
and the two-step estimator ATALasso-i-OLS- 

We use the R package “glmnet” to compute the Lasso 
solution path. We select the tuning parameters Aa 
and Xb by 10-fold Cross Validation (CV) and denote 
the corresponding adjusted estimators as cv(Lasso) and 
cv(Lasso-l-OLS) respectively. We should mention that 
for the cv(Lasso-l-OLS) adjusted estimator, we compute 
the CV error for a given value of the Aa (or Xb) based 


Jo-) 




(al) I A“) 


1=1 




1=1 


where and are drawn independently from stan¬ 
dard normal distribution. The vector z.^ is independent 
of Xi and also drawn independently from the multivari¬ 
ate normal distribution A/"(0, E). The values of Xj, 
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Zi, e-“\ Gi and bi are generated 
once and then kept fixed. 

After the potential outcomes are generated, a com¬ 
pletely randomized experiment is simulated 25000 times, 
assigning ha = 100,125,150 subjects to treatment A and 
the remainder to control B. There are 12 different combi¬ 
nations of (PjPjUa) in total. 

Figures 8, 9, 10 show the boxplot of different ATE es¬ 
timators with their standard deviations (computed from 
25000 replicates of randomized experiments) presented on 
top of each box. Regardless of whether the design is bal¬ 
anced {ha = 125) or not {ua = 100,150), the regression 
based estimators have much smaller variances and than 
that of the unadjusted estimator and therefore improve 
the estimation precision. 

To further compare the performance of these estima¬ 
tors, we present the bias, the standard deviation (SD) 
and the root-mean square error (v/MSE) of the estimates 
in Table 4. Bias is reported as the absolute difference 
from the true treatment effect. We find that the bias of 
each method is substantially smaller (more than 10 times 
smaller) than the SD. The cv(Lasso) and cv(Lasso-l-OLS) 
adjusted estimators perform similar in terms of SD and 
VMSE: reducing those of the OLS adjusted estima¬ 
tor and the unadjusted estimator by 10% — 15% and 
15% — 31% respectively. We also compare the number 
of selected covariates by cv(Lasso) and cv(Lasso-l-OLS) 
for treatment group and control group separately, see Ta¬ 
ble 5. It is easy to see that the cv(Lasso-l-OLS) adjusted 
estimator uses many fewer (more than 44%) covariates 
in the adjustment to obtain similar improvement of SD 
and VMSE of ATE estimate as the cv(Lasso) adjusted 
estimator. Moreover, we find that the covariates selected 
by the cv(Lasso-l-OLS) are more stable across different 
realizations of treatment assignment than the covariates 
selected by the cv(Lasso). Overall, the cv(Lasso-l-OLS) 
adjusted, the cv(Lasso) adjusted, the OLS adjusted and 
the unadjusted estimators perform from best to worst. 

We move now to study the finite sample performance 
of Neyman-type conservative variance estimates. For 
each simulation example and each one of the 25000 com¬ 
pletely randomized experiments, we calculate the ATE 
estimates (ATE) and the Neyman variance estimates (ct) 
and then form the 95% confidence intervals \ATE— 1.96- 
uj-Jn^ATE T 1.96 • Figures 5, 6, 7 present the 

boxplot of the interval length with the coverage proba¬ 
bility noted on top of each box for the unadjusted, OLS 
adjusted (only computed when p = 50), cv(Lasso) ad¬ 
justed and cv(Lasso-l-OLS) adjusted estimators. More 
results are showed in Table 6. We find that all the con¬ 
fidence intervals for the unadjusted estimator are conser¬ 
vative. The cv(Lasso) adjusted and the cv(Lasso-|-OLS0 
adjusted estimators perform very similar: although their 


coverage probability (at least 92%) may be slightly less 
than the pre-assigned confidence level (95%), their mean 
interval length is much shorter (26% — 37%) than that 
of the unadjusted estimator. The OLS adjusted esti¬ 
mator has comparable interval length with the cv(Lasso) 
and cv(Lasso-l-OLS) adjusted estimator, but has slightly 
worse coverage probability (90% — 93%). 

To further investigate how good the Neyman standard 
deviation (SD) estimate is, we compare them in Figure 11 
with the “true” SD presented in Table 4 (the SD of the 
ATE estimates over 25000 randomized experiments). We 
find that Neyman SD estimate is very conservative for the 
unadjusted estimator (its mean is 5% — 14% larger than 
the “true” SD); while for the OLS adjusted estimator, the 
mean of Neyman SD estimate can be 6% — 100% smaller 
than the “true” SD which may be because of over-fitting. 
For the cv(Lasso) and cv(Lasso-l-OLS) adjusted estima¬ 
tor, the mean of Neyman SD estimator is within 1 ± 7% of 
the “true” SD. Although the Neyman variance estimate 
is asymptotically conservative, the finite sample behavior 
of the Neyman SD estimate can be progressive for the 
regression-based adjusted estimator. However, if we in¬ 
crease the sample size n to 1000, almost all the confidence 
intervals are conservative. 

We conduct more simulation examples to evaluate the 
conditions assumed for asymptotic normality of the Lasso 
adjusted estimator. We use the same simulation setup as 
above, but for simplicity, we generate the potential out¬ 
comes from linear model (set = ^(^ 2 ) _ 

move the effects of the hidden covariates Zi in generating 
the error terms and and set p = 0, ua = 125. 
We find that the distribution of the cv(Lasso) adjusted 
estimator may be non-normal when: 

(1) . The covariates are generated from Gaussian distri¬ 

bution and the error terms do not satisfy second mo¬ 
ment condition, e.g., being generated from t distri¬ 
bution with one degree of freedom, see the upper two 
subplots of Figure 1 (in the main text) for the his¬ 
tograms of unadjusted the cv(Lasso) adjusted esti¬ 
mators (the corresponding p-values of Kolmogorov- 
Smirnov testing for normality are less than 2.2e—16). 

(2) . The covariates do not have bounded fourth moments, 

e.g., being generated from t distribution with three 
degrees of freedom, see the lower two subplots of Fig¬ 
ure 1 (in the main text) for the histograms of unad¬ 
justed the cv(Lasso) adjusted estimators (again, the 
corresponding p-values of Kolmogorov-Smirnov test¬ 
ing for normality are less than 2.2e — 16). 

These findings indicate that our moment condition (Con¬ 
dition 2 and Remark 1) cannot be dramatically weak¬ 
ened. However, we also find that the cv(Lasso) adjusted 
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estimator still has smaller SD and VMSE than the unad¬ 
justed estimator even when these moment conditions do 
not hold. 


B. The design matrix of the PAC 
data 

In the PAC data, there are 59 covariates (main effects) 
including 50 indicators which may be correlated with the 
outcomes. One of the main effects (called interactnew) 
has heavy tail, so we do the transform: x —>■ log(|a;| -I- 1) 
to make it look like normal distributed. We then cen¬ 
tralize and standardize the non-indicator covariates. The 
quadratic terms (9 in total) of non-indicator covariates 
and two-way interactions between main effects (1711 
in total) may also contribute to predict the potential 
outcomes, so we included them in the design matrix. 
The quadratic terms and the interactions between non¬ 
indicator covariates and the interactions between indica¬ 
tor covariates and non-indicator covariates are also cen¬ 
tered and standardized. Some of the interactions are ex¬ 
actly the same as other effects and we only retain one of 
them. We also remove the interactions which are highly 
correlated (with correlation larger than 0.95) with the 
main effects and remove the indicators with very sparse 
entries (where the number of I’s is less than 20). Finally, 
we form a design matrix X with 1172 columns (covari¬ 
ates) and 1013 rows (subjects). 


C. Estimation of constants in the 
conditions 

Let = {j ■■ 13^'' ^ 0} and = {j : I3f ^ 0} 
denote the sets of relevant covariates for treatment group 
and control group respectively. Denote S = [J = 
{j : 0 or 0}. We use bootstrap to get an 

estimation of the relevant covariates sets and 

then the approximation errors and are estimated 
by regressing the observed potential outcomes a and b on 
the covariates in S respectively. We only present how to 
estimate 5'^“^ and in detail and the estimation of 
and are similar. 

Let A, B be the set of treated subjects (using 
PAC) and control subjects (without using PAC) re¬ 
spectively. Denote a^,* S A the potential outcomes 
(quality-adjusted life years (QALYs)) under treatment 
and Xi G the covariates vector of the iih sub¬ 

ject. For each d = 1,..., 1000, we draw a bootstrap sam¬ 
ple {{a*{d),x*{d)) : i G A} with replacement from the 
data points {{ai,Xi) : i G A}. Then computing the Las- 
soOLS(CV) adjusted vector ^{d) based on each bootstrap 


sample {{a*{d),x*{d)) : i G A}. Let tj be the selection 
fraction of non-zero /3y(d) in the 1000 bootstrap estima¬ 
tors, i.e., Tj = (1/1000) where I is the 

indicator function. We form the relevant covariates 5'^“^ 
by the covariates whose selection fraction are larger than 
0.5: = {j : Tj > 0.5}. 

To estimate the approximation error e^°‘\ we regress 
on the relevant covariates Xij,j G and compute OLS 
estimate and the corresponding residual. That is, let 
denote the complement set of 

/3ols = argmin V (oi - aa - (xi - xa)'^/3') . 

P: a.-=o. vjerM 2nA ^ V / 

e-“^ =ai-dA- (xi - xa)^^olS’ * ^ 

The maximal covariance 5n is estimated as: 


max 


max 


UA 


E( 

i^A 


Xij 


max 

3 





) 



D. Proofs of Theorem 1, 2, 3 and 
Corollary 1 


In this section, we will prove Theorem 1-3 and Corollary 
1 under weaker sparsity conditions. 


Definition 3 We define an approximate sparsity mea¬ 
sure. Given the regularization parameter XaAb and 
and , the sparsity measures for treatment and control 
groups, and s^^^ are defined as 


(a) . 

s\ = y mm 



o(^) 


p 

min 

1=1 



(24) 

respectively. We will allow and s^^^ to grow with n, 
though the notation does not explicitly show this. Note 
that this is weaker than strict sparsity, as it allows [3^°''^ 
and to have many small non-zero entries. 


Condition (*). Suppose there exist f3^^\ Xa and 
Xb such that the conditions 1, 2, 3 and the following state¬ 
ments 1, 2, 3 hold simultaneously. 


• Statement 1. Decay and scaling. Let s\ = 

f (a) {b)\ 
max|s),/,s)^/j. 


(sAVlogp) ' 

(25) 

(sAlogp)/\/n = o(l). 

(26) 
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• Statement 2. Cone invertibility factor. Define the 
Gram matrix as E = ~ x)(xi — x)^: 

There exist constants C > 0 and ^ > 1 not depending 
on n, such that 

llhslli <CsA||Sh||o,, VheC, (27) 

with C = {h: ||hsc||i < ^||hs||i}, and 

^ = {j:|/3]“^l>AaOr|/3f |>A4. (28) 


• Statement 3. Let r = min {l/70, (3p^)^/70, (3— 
3p^)^/70}. For constants 0 < r] < and 0 < 
M < oo, assume the regularization parameters of 
the Lasso belong to the sets 


Aq € (—, M] X 
V 


2(1+ r)L^/^ 12 \ogp 


PA 


, (29) 


Xb G ( —, M] X 

V 


2(1 + 12 logp 


PB 


■ (30) 


It is easy to verify that Condition (*) is implied by 
conditions 1-6. In the following, we will prove Theorem 
1-3 and Corollary 1 under the weaker Condition (*). 

'■ (®) 

For ease of notation, we will omit the subscript of /Ql^sso) 

/^Lassoi '5 a, and from now on. Moreover, we can 
assume, without loss of generality, that 


a = 0, & = 0, X = 0. (31) 

Otherwise, we can consider at = at — a, bi = bi — b and 
Xi = Xi — X. Then, ATE = a — b = 0 and the definition 
of ATEi^asso becomes 


ATi?Lasso = 


QA - ( xa ) P 


bs - (xb)'^^^^^ 


.(32) 


We will rely heavily on the following Massart concen¬ 
tration inequality for sampling without replacement. 


Lemma 1 Let {zi,i = l,...,n} be a finite population of 
real numbers. Let Ac {i,...,n} be a subset of deter- 
ministie size |A| = ua that is seleeted randomly without 
replaeement. DefinepA = riAln, = n~^ ■ 

Then, for any t > 0, 

P{zA-z>t)<e^pi^-^^^^Y (33) 

with T = min {l/70, (3 pa)^/70, (3 — 3pa)^/70}. 

Remark. Massart showed in his paper [26] that for sam¬ 
pling without replacement, the following concentration in¬ 
equality holds: 

P {zA - z > t) < exp I • 


His proof required that n/uA must be an integer. We 
extend the proof to allow n/uA to he a non-integer but 
with a slightly larger constant factor (1 -|- r)^. 


Proof. Assume z = 0 without loss of generality. For ua < 
n/2, let m > 2 and r > 0 be integers satisfying n—UAm = 
r < UA. Let u > 0. We first prove that 



<Aexp Zi/{m{m -\-1)}-\-u'^na'^[4 


ieB 


(34) 


for a random subset B C {1,..., n} of fixed size \B\ < n/2 
and a certain fixed S G {—1,1}. Let Pi be the proba¬ 
bility under which {ii,... ,z„} is a random permutation 
of (1,..., n}. Given {*i,..., f„}, we divide the sequence 
into consecutive blocks Bi,..., Bnj^ with || = m-|-1 for 
j = 1,..., r and \Bj\ = m for j = r -I- 1,..., u-a■ Let Zk be 
the mean of {zi : i G B^} and P 2 be a probability condi¬ 
tionally on {ii,... ,in} under which Wk is a random ele¬ 
ment of {zi'.iG Bk}, k = 1,.. .,nA. Then (wi,..., Wua} 
is a random sample from {zi,...,z„} without replace¬ 
ment under P = P 1 P 2 . Let Ak = max^gBs, Zi — miiii^Bk 
and denote E 2 the expectation under P 2 . The Hoeffding 
inequality gives 


E 2 exp j u Wk ] < exp j u Zk + (uV8) EA 




k^l 


k^l 


As Af < 2.'Y[ji^BSZi 


(35) 


E 2 exp [ u Wk J < exp j u Zk /4 J (36) 


k=l 


k=l 


Let B = As z = 0, 


E ^ E {'mfm -G 1 )}. 


(37) 


k—1 


This yields 34 with (5 = 1 when \B\ < n/2. Otherwise, 
34 holds with d = — 1 when B is replaced by B^, as 
E*g B^i = - E*gb-= due to 2 = 0. 

Now, as m(m -b 1) > 6, repeated application of 34 
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yields 



< E exp 


u6' ^ Zi/{m{m + l)m'{m' + 1)} 


ieB' 


+ (l + {m(m + 1)}”^) w^ncr^/4] 

< exp [(l + {m{m + 1)}“^(1 + 1/36 + 1/36^+ 
• • •)) M^ncr^/4] 

= exp [(l + (36/35){m(TO + 1)}”^) u^nCT^/4] 


< exp 


(1 + t)^ u^ncr^/4 


(38) 


with T = (18/35){m(TO + 1)}“^. The upper bound for r 
follows from 2 < m < nj ua < m + 1. 

As z = 0, we also have 


if exp j u EH < exp (1 + r)^ u^na"^ jA 


ieA 


for UA > u/2. This yields 33 via the usual 
P {ZA - z >t} 

< exp [—ut + (1 + T)^u^nCT^/(4n^)] 


= exp 


^ PAnAt^ _l_ PAnAt'^ 


(1 + r)2(T2 (1 + r)2cr2 

with u = 2pAnAt/ {cr(l + r)}^. 


(39) 


(40) 


D.l. Proof of Theorem 1 

Proof. Recall the decompositions of the potential out¬ 
comes: 


а, = a + (x, - x)^/3(“) + 6^“^ = xf + 6^“^, (41) 

б, = 5 + (x, - 5cf(3^'^'> + = xf+ ef^. (42) 

If we define ^ = (3^ by 

substitution, we have 


Vri(ATifLasso - ate) 


= \/n 




— e 


B 


- y/n (xa)^ - (xb)^ 


of the terms. We will focus on the term involving the 
treatment group A, but exact same analysis is applied to 
the control group B. We have the bound 


(x^)^h(“) <||x^|U||h(° 


(43) 


We will bound the two terms on the right hand side of 
43 by the following Lemma 2 and Lemma 3, respectively. 


Lemma 2 Under the moment condition of [6], if we let 
_ {i+t)l — / 2 iogp ihgfi as n ^ oo, 

PA \j n ^ ^ 

-P(I|xa|Ioo > Cn) ^ 0. 


Thus, IIxaIL = Op • 


Lemma 3 Assume the conditions of Theorem 1 hold. 
Then ||hO)||i = Op . 

The proofs of these two Lemmas are below. Using 
these two Lemmas, it is easy to show that (**)= y/n ■ 

Ov{\r^)»v(7kt)=°AB 

D.2. Proof of Corollary 1 


Proof. By Theorem 1 in [11], the asymptotic variance of 
y/n ATUunadj is lim„_j.oo cr^ + T^w-lim„ 

2 lim„_,,oo aab, so the difference is 


CT? + 


1 Pa ,. f 2 2\ I Pa i- /" 2 2\ 

hm (cr („) - cr„) -b -- hm (cr („) - 

n—^oo ' — Pa n—KX) ' ^ ' 


PA 

+ 2 lim (crg(a)g(i,) - (Tab) ■ 

n—3-oc 


We will analyze these three terms separately. Since 
and Xj3^^'^ are the orthogonal projections of a and b onto 
the same subspace, we have 

(A:/30))'re(“) = (x/30))re(6) 

= (A:,3(^))^eO) = = 0. 

Simple calculations yield 


2 \\Ua)\\2 

e(«) -^a = l|e^ II 2 - 


= -||X/3(“)||2, 


We will analyze these two terms separately, showing 
that (*) is asymptotically normal with mean 0 and vari¬ 
ance given by 17, and that (**) is Op (1). 

Asymptotic normality of (*) follows from the Theorem 
1 in [11] with a and b replaced by e^A and respectively. 
To bound (**), we will apply Holder inequality to each 


-^b =l|e^'^ll2-|l&ll2 = -|l^/3^'^ll2, 

CTeWeW-^ab = {c^^Y {e^^^ ) - b = - {X {X ). 

Combining the above three equalities, we obtain the corol¬ 
lary. 
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D.3. Proof of Theorem 2 


By definition 21 and simple calculations, 


Proof. To prove Theorem 2, it is enough to show that 




lim 


2 


(44) 


;^2 


lim 


^2 

(Jeib) • 


(45) 


We will only prove the statement 44 and omit the proof 
of the statement 45 since it is very similar. 

We first state the following two lemmas. Lemma 4 
bounds the number of selected covariates (covariates with 
a nonzero coefficient), while Lemma 5 establishes condi¬ 
tions under which the subsample mean (without replace¬ 
ment) converges in probability to the population mean. 


Lemma 4 Under the conditions in Theorem 2, there ex¬ 
ists a constant C, such that the following holds with prob¬ 
ability going to 1: 


e(“) 


aA- (Xi - 


HA - df^ 


i£A 


1 


UA - df( 


{Xi - Xa)^/3^“^ 


ieA 

-b(xi - - ,9^“^ 

1 


UA - df^ 




ieA 

+ (x,-xa)^(^^“^-^ 
nA 1 
UA - df'^°-') UA 


X + (x* - Xa)’ 


(/3(“) _ 


iGA 
2 


UA 


UA - df^^'> 1 UA 


5 ;^i:( 


(a) (a) 

^ A 


i^A 


' +—X 

^^7tA 


((x*- xa)^(/3^“^ I -b 


UA 


nA - dfi<^) 


. iGA J 


< Cs- S^^'> < Cs. 


(46) The second to last equality is due to the decomposition 
of potential outcome a: 


The proof of Lemma 4 can be found below. 


Oi = xjl3^°'^ -+- a A = {xAf 




Lemma 5 Let {zi,i = l,...,n} be a finite population of 
real numbers. Let Ad {i,...,n} be a subset of deter¬ 
ministic size |yl| = UA that is selected randomly without 
replacement. Suppose that the population mean of the Zi 
has a finite limit and that there exist constants e > 0 and 
L < oo such that 


It is easy to see that 


E (4“’ -»!.“’)' = ^ - (U'f- m 

i^A iGA 


By the 4th moment condition on the approximation error 
6 ^“) (see 7), and applying Lemma 5 we get 


1 

n 


Ei*.r‘ 


< L. 


(47) 


±y-,,(<.))4 4 


lim 


nA 




64 “^ ^ lim = 0 . 


iGA 


If^^PA^ (0,1), then 


Therefore, 


ZA lim z. 

n—^oo 


(48) 


1 

UA 


E(4 

iGA 



P, r 2 

-J- lim (T„(a). 

n—¥oo 


(50) 
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Simple algebra operations give 


true asymptotic variance: 


1 , p,(a) _ 


HA 


^ ((x, 


iGA 


— y^(Xi - XA)(Xi - Xa) 

.^^TtA 




< • II—-XA)(Xi - Xa) 


1 


\T\ 


(51) 


We next show that 51 converges to 0 in probability. By 
Lemma 3 and Lemma 7, we have 


||/3(“)-3^“^||i = ||h(“)||i = 


1 


Viogpy ’ 


(52) 


|— y^(xi - XA)(xi - xa)’^||oo = Op(l). (53) 

^^ttA 


Therefore, 




0 . 


(54) 


By Cauchy-Schwarz inequality, 

7^ 4 • * 


UA 


ieA 


< 




ieA 


1 

UA 


ieA 


X:((x.-sAq/3'“>-y’ 


(55) 


which converges to 0 in probability because of 50 and 54. 
By Lemma 4 and Condition 4, we have 


UA 


UA 


1 . 


UA — UA — — 1 

Combining 50, 54, 55 and 56, we conclude that 


(56) 


CT („) ^ lim . 

^ n—^oo ^ 


The remaining part of the proof is to study the differ¬ 
ence between the conservative variance estimate and the 


— lim CTp(„) -I- -—— lim aVt) 
Pa ^ 1 — n^oo 


1 -PA 
PA 


lim a + - - lim +2 lim fTe(<.)e(i.) 

n—>oo 1 — n—^oo n—^oo 

= lim CTa„) -I- lim al^b) -2 lim (TeWeW 

n—^oo n—^oo n—¥oo 


= lim <(“)-e('>) 


= lim - ^ (ci - 6i - xf (/3l“l - 

n—>-oo n \ 




(57) 


D.4. Proof of Theorem 3 


Proof. By Lemma 4, max (sl“l, s^^l) = Op(min (ua, n-s)). 
Therefore, (ct^(„) , ) and have the 

same limits. The conclusion follows from Theorem 2. 


E. Proofs of Lemmas 

In this section, we will drop the superscript on h, e and 
(3 and focus on the proof for treatment group A, as the 
same analysis can be applied to control group B. 

E.l. Proof of Lemma 2 

Proof. Let c„ = *'^~*~p^^ ^ union bound, 

1 


^(IIxaIIoo >Cn)=P[ max 


UA 


i^A 


> Cn 


^E^ 

i=i 


-E^ 

UA 

ieA 


(58) 


Crj 


By Cauchy-Schwarz inequality, we have 


1 ” ^ /I ” 

^ E „ E ■ 


n 


(59) 


Substituting the concentration inequality 33 into 58, 


^(IIxaIIoo > c») < 2expqogp- 


PAriAcj \ 
(l-hr)2Li/2 j 

= 2 exp {— logp} —!► 0. 

E.2. Proof of Lemma 3 

Proof. We start with the KKT condition, which charac¬ 
terizes the solution to the Lasso. Recall the definition of 
the Lasso estimator (3: 

^ = argmin,:^y' (oi - oa - (x^ - XA)^/3)^-bAa ||^||i • 
(3 2nA ^ ^ 


ieA 
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The KKT condition for j3 is 
1 

UA 


^(xi - x^) - aA- (xi - xa)^/ 9) = AaK, (60) 

i^A 

where k is the subgradient of 11/3| |i taking value at /3 = /9, 
i.e., 


K e d\\P\\i 


13=13 


with 


G [—1,1] for j s.t. Pj = 0 


Kj = sign(/3j) otherwise 


(61) 


Substituting by the decomposition 3, 60 becomes 
1 


riA 


^(x, - xa)(x, - xa)'^(/3 - 0 ) 


i£A 


- y^(Xi - XA)(ej - Ca) = XaK. 

n ^ ^ 


(62) 


UA 


i^A 


Multiplying both sides of 62 by —= (/3 — ^)^, we 
have 

— ^ ((xj - XA)^h)^ - h^— ^(xi - XA)(ei - ga ) 

= \a{P-hf^<\a (||/3|li-||,9||i 

where the last inequality holds because 

< ||/3||i||«:||oo < ||/3||i and k = \\P\\i. 
Rearranging, and applying Holder’s inequality, we have 
1 


— ^((x,-XA)^h)^ 


UA 
< -a 


ieA 


< A„ (||/3|li - ||,9||i) + - XA)(ei - ca) 


< Aa 


-II/3II: 


ieA 


1 


— y^(x* - XA)(ei - ca) 


To control the term (*), we define the event C = 
{* < rjXa). The following Lemma 6 shows that, with Ao 
defined appropriately, C holds with probability approach¬ 
ing 1. We will prove this Lemma later. 

Lemma 6 Define 

'C= {||;^EiGA(x*-XA)(ei-eA) ^ < jyAa}- 
Then under the conditions of Theorem 1, P{C) -A 1. 


On£ 

1 

nA 


^ ((x, - xa)"^!!)^ < Aa (ll^lli - ||/3||i + 11 ||h||i) . 

(63) 


i^A 


By substituting the defiition of h, and several applications 
of the triangle inequality, we have 

Il^|li-||^||i<||hs||i-||h 5 c||i +2 11/35.11,. 

Therefore, 

1 


UA 


((x, -XA)^h)' 


iGA 


<Aa(||h5||,-||hs=||, +2 11 / 35 . 11 ,+7? ||h||,) 

< Aa ((?7 — 1) ||hsc||, + (1 + 77 ) ||h5||, + 2 II/ 35 .II,). 

Because Z^^gA ((^* ~ XA)^h)^ > 0, we obtain 

(1 - v) l|hs=|li 

< (1 + 77 ) ||hs||, + 2 II/ 35 .II, < (1 + 77 ) ||hs||, + 2 sAa. 

(64) 

where the last inequality holds because of the definition 
of s in 24 and S in 28. 

Consider the following two cases: 

(I) If (l + 77)||hs||i + 2sAa > (1- 77)^11 hsili then by 64, 


|h||, = ||h5||, + ||h5c||, 
1 + 77 


< 


< 


1-77 
2sA. 


I)l|h5||i + 

2 


1-77 \{l - T])^ - {I + r]) 


2sXa 

1-77 

+ 1 


By the definition of Aa and the scaling assumptions 25, 
26, we have that sXa = o ^ ^logp ) ■ 

(II) If (I + ?7)||hs||i + 2sAa < (I-77)^||hs||i then by 64 
we have ||h 5 c||, < ^||hs||i. Applying the cone invertibil- 
ity condition on the design matrix 27, 


|h||, = ||h5||, + ||h5=lli 

<(l+C)||hs||i<(l+OCs 


-A^Ah 

n 


(65) 


Before applying this inequality we will revisit the KKT 
condition 61, but this time we will take the Igo-norm, 
yielding 


— y^(Xi - XA)(Xi - XA)^h 

71 A 


iGA 

< Aa + 


— y^(xi - XA)(ej - ca) 


iGA 


< (1 + r])Xa, 

( 66 ) 


where the latter inequality holds on the set C. The final 
step is to control the deviation of the subsampled covari¬ 
ance matrix from the population covariance matrix, so 
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that we can apply 65. We define another event with con¬ 
stant Cl = ^ 


PA 


M =■ 


<Ci 


— y^(xi - x^)(xi - x^)^ - -X'^X 
log pi 




Lemma 7 Assume stability of treatment assignment 
probability eondition 1 and moment condition 6 hold. 
Then P{M) 1. 

We will prove Lemma 7 later. Continuing our inequal¬ 
ities, on the event CAM, 


1 


-X^Xh 


<Cis 


logp 


— y^(Xi - XA)(Xi - XA)’^h 


<o(l) l|h||i + s{l + ri)Xa, 

where we have applied the scaling assumption 26 and 66 
in the second line. Hence, by 65, 

||h||;i < (1 + 0^ [o(l) ||h||^ -I- s(l -I- r])Xa] ■ 

Again, applying the scaling assumptions 25 and 26, we 
get ||h||i =Op(^). 

E.3. Proof of Lemma 4 

Proof. In the proof of Lemma 3, we have shown that, on 
C defined in Lemma 6, 


1 

HA 


+i:((x.-xAq/3-«)’ 

i^A 

<Aa(||/3||i-||3l|l+5?ll/3-;3||i). 

(67) 

<Aa(l + r/)||/3-3||i. 

(68) 

be the j-th column of the design matrix X and 
A^ X)iGA Again, by KKT conditon, we have 

- ^a) if^i - OA - (Xi - xa)^^) 

iGA 

— 


if/3^ ^0. 

Substituting ai by the decomposition 3 yields 


— - XA)(ei - e^) + — - x^) 

ieA 

(Xi - xa)’^(/3 - ^) = Xa. 


Combining with the definition of the event C, we have if 


Aj 


— - ^A)i^i - ^Af{f3 - 

iGA 


> (l-77)Aa. 

(69) 

Let Z = (zi,..., z„) G /jpx" -yirith z^ = x^ — x^ G and 
denote w = Z’^{f3 — $), then 

—iiwa||2 =—X! -^^)^(/3-/3)) 

<Xa(\pp)w-h\v- 

Let = (zi : f G A); since the largest eigenvalues of 
Z'^Za and ZaZ\ are the same, 

\^\Z\Za^a 

n\ 


< ^Ai„ax(^I^A)||wA|| 


< —An>ax(^A^I)Aa(?? +1)11/3 -/3||l 

UA 

+ Aniax-Aa(l + ?7)||^ — ^||l. 

UA 

The last inequality holds because 

AmaxC-^A^J) 

= max ZaZ^u 

u:||u ||2 = l 


max 

u:||u ||2 = l 


^(Xi - XA)(Xi - Xa)’^U 
iGA 

^ Xixf u - nAU^(xA)(xA)^U 

iGA 

< max > xixf u < nAinax- (70) 

u:||u||2 = l 


max 

u:||u ||2 = l 


ieA 


On the other hand. 


^-wJzJZawa = ^ Af > (1 - pfXls. 

A 

•5=1 LA+O 

(71) 

Combining 69, 71 and the fact that with probability go¬ 
ing to 1 (see the proof of Lemma 3) 

||/3-3||i<Cs(l + r,)Aa, 

where C is a constant, we conclude that with probability 
going to 1, 


s < 


11 TL 

Ajnax Aa(l + 77)Cs(l + Tj)Xa 


{1-VrXl ^^^riA 
2 


C(l^ 

- pa(1 - V? ■ 
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E.4. Proof of Lemma 5 

Proof. For any t > 0, we have 
P{\zA— lim z\ > t) < P{\zA—z\ > t/2)+P{\z— lim z\ > t/2). 

n—¥(yo n—>-oo 

(72) 

The second term in the right hand side of 72 obviously 
converges to 0 as n —>■ oo. To bound the first term, we 
apply the concentration inequality 33. By 47, it is easy 
to show 


Let tn = ' / 2 iogp union bound and 

the concentration inequality 33, 

( I I ^ ^2 ^2 ^ ^2 ^2 I I OO ^ I 


< 2 exp \ logp — 


PAUAti 


-21 1 ^ 




1 

< (nL) < LT+'ni+' 

n 

i=l 


(l + r)2L) 

= 2exp{—logp} —>■ 0. 

Taking this back to 74, we have 

P ( II- ^ XiCilloo ^ Iri + I —^1. 

V / 


(76) 


Concentration inequality 33 yields 


For the second term, by Lemma 2, we have shown that, 

1 . 


^ I|X^IL< 


(1 + t)L^/"^ 12 logp 


PA 


P{\zA — z\ > tl‘2-) < 2exp < — 


PAnAt 


4(1 + T)2Li+«ni+' 


0. A similar proof yields 


E.5. Proof of Lemma 6 

Proof. It is easy to verify that 

E (xi - XA)(ei - ca) = — XiCi - (xA)(eA)- 

n A 


P lle-^IL < 


(1 + t)L 1/4 /21ogp 


PA 


Hence, under the scaling condition 26, 


UA 


ieA 


riA 


ieA 


pf||(xA)(e-A)|L< 


(l + r)Li/2 /21ogp 


PA 


n 


1 . 


^1. (77) 


Hence, 




y](xi - XA)(ei - 


Combining 76 and 77 yields 
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CA) 


ieA 

< || — y'xjCil 

riA ^ 


||(kA)(eA)||oo- 


(73) 


P II— - XA)(ei - eA)||oo 

\^^7tA 


< 


2(1 + t)L 1/2 /21ogp 


i^A 


PA 


1 . 


n 


We analyze these two terms on the right hand side of the The conclusion follows from the condition e (^ M] x 

inequality separately. For the first term, by the triangle ( 2 (i+t)l^^^ / 2 iosp j ) 

inequality and the definition of in 9, pa V " "y 


|—Vx,ei| 


(74) 


II 1 1 ^ 

— 11 / ^ ^ 
riA ^ 




i-E 


X,;e,; 


i£A 


2=1 


1 - 1 ^ 

E 11 ^ XjCj ^ I |cso “1“ f^ri- 

zGA 2=1 


(75) 


We will again bound 74 by the concentration inequality 
33 in Lemma 1. By the Cauchy-Schwarz inequality, we 
have for any j = 1, ..,p, 


E.6. Proof of Lemma 7 

Proof. It is easy to see that 

— y'(Xj-XA)(Xi-XA)^ = — V'xixf- (xa)(xa)^. 

^A ttA 

Then, by triangle inequality, 

— y^(xj - xa)(xj - xa)^ --A^A||oo (78) 

) A ( ^ 71 


riA 


i^A 




X,;X,- 


\T\ 


||(xa)(xa) 


ieA 
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E4e? < 


1 


E4) (^E 


< L. 


2=1 


2 = 1 


2=1 


(79) 
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We control the first term (*) again using the con¬ 
centration inequality 33 and the union bound. By the 
wayCauchy-Schwarz inequality, for j, fc = 1, 


n n 


■'^4k] <L. 


2=1 


2 = 1 


2=1 


Then, 




31ogp 


n 


< 2exp|21ogp — 

= 2exp{—logp} —>■ 0. 


SpATiAil + T)'^L\ogp 


(1 -I- T)‘^Lp\n 


( 80 ) 


The second term (**) is bounded by again observing 
that, by Lemma 2 and the scaling condition 26, 


(**) < IIxaIIL = Op( 


logp 


Combining 80 and 81 yields the conclusion. 


( 81 ) 


F. Tables and Figures 
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Boxplot of interval length (95% confidence interval) with coverage probability on top (0^=100) 



Figure 5: Boxplot of the interval length with coverage probability (%) on top of each box for the unadjusted, OLS 
adjusted (only computed when p = 50), cv(Lasso) adjusted and cv(Lasso+OLS) adjusted estimators with ua = 100. 


Boxplot of interval length (95% confidence interval) with coverage probability on top (0^=125) 



Figure 6: Boxplot of the interval length with coverage probability (%) on top of each box for the unadjusted, OLS 
adjusted (only computed when p = 50), cv(Lasso) adjusted and cv(Lasso+OLS) adjusted estimators with ua = 125. 
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Boxplot of interval length (95% confidence interval) with coverage probability on top (0^=150) 



Figure 7: Boxplot of the interval length with coverage probability (%) on top of each box for the unadjusted, OLS 
adjusted (only compnted when p = 50), cv(Lasso) adjusted and cv(Lasso+OLS) adjusted estimators with ua = 150. 


Boxplot with Standard Deviation on top (nA=100) 
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Figure 8 : Boxplot of the unadjusted, OLS adjusted (only computed when p = 50), cv(Lasso) and cv(Lasso+OLS) 
adjusted estimators with their standard deviations presented on top of each box for ua = 100. 
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Boxplot with Standard Deviation on top (nA=125) 



Figure 9: Boxplot of the unadjusted, OLS adjusted (only computed when p = 50), cv(Lasso) and cv(Lasso+OLS) 
adjusted estimators with their standard deviations presented on top of each box for = 125. 


Boxpiot with Standard Deviation on top (nA=150) 
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Figure 10: Boxplot of the unadjusted, OLS adjusted (only computed when p = 50), cv(Lasso) and cv(Lasso+OLS) 
adjusted estimators with their standard deviations presented on top of each box for ua = 150. 
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Unadjusted 
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cv(Lasso) 




1 5 9 


index 


cv(Lasso+OLS) 



index 


Figure 11: Boxplot of Neyman SD estimate with the “true” SD presented as red dot. 

Treated Control 




Figure 12: Adjustment (fitted) value comparison for cv(Lasso) and cv(Lasso+OLS). 
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Fourth moment 



Figure 13: Fourth moment of each covariate. The covariates with the largest two fourth moments (37.3 and 34.9 
respectively) are quadratic term interactnew^ and interaction term IMscorerct : systemnew respectively. Neither of 
them are selected by the Lasso to do the adjustment. All the fourth moments of the main effects are less than 7. 
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Figure 14: Histograms of ATE estimates. The green vertical lines are the true ATE; the red curves are the density 
of normal distribution; the blue curves are the kernel density estimate. The blue curves are very close to the red ones 
meaning that all the ATE estimates follow normal distribution. 
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Table 4: Bias, standard deviation (SD) and root-mean square error VMSE of ATE estimates 

_ iP,P) 


Statistic 

Method 

(50,0) 

(50,0.6) 

(500,0) 

(500,0.6) 

ua = 100 

bias 

Unadjusted 

OLS 

cv(Lasso) 

cv(Lasso-l-OLS) 

0.003(0.004) 

0.014(0.005) 

0.007(0.004) 

0.011(0.004) 

0.005(0.005) 

0.013(0.006) 

0.014(0.005) 

0.013(0.005) 

0.002(0.003) 

0.006(0.004) 

0.009(0.004) 

0.003(0.005) 

0.005(0.004) 

0.003(0.004) 

SD 

Unadjusted 

OLS 

cv(Lasso) 

cv(Lasso-l-OLS) 

0.79(0.08) 

0.72(0.07) 

0.62(0.06) 

0.63(0.06) 

1.17(0.11) 

0.96(0.09) 

0.82(0.08) 

0.82(0.08) 

0.79(0.07) 

0.67(0.06) 

0.65(0.06) 

1.17(0.11) 

0.84(0.08) 

0.84(0.08) 


Unadjusted 

OLS 

cv(Lasso) 

cv(Lasso-l-OLS) 

0.79(0.08) 

0.72(0.07) 

0.63(0.06) 

0.63(0.06) 

1.17(0.11) 

0.97(0.09) 

0.82(0.08) 

0.82(0.08) 

0.79(0.07) 

0.67(0.06) 

0.65(0.06) 

1.17(0.11) 

0.85(0.08) 

0.84(0.08) 

UA = 125 

bias 

Unadjusted 

OLS 

cv(Lasso) 

cv(Lasso-l-OLS) 

0.008(0.005) 

0.008(0.004) 

0.005(0.003) 

0.012(0.004) 

0.011(0.007) 

0.005(0.005) 

0.012(0.005) 

0.012(0.005) 

0.006(0.004) 

0.007(0.004) 

0.011(0.004) 

0.01(0.007) 

0.004(0.004) 

0.003(0.003) 

SD 

Unadjusted 

OLS 

cv(Lasso) 

cv(Lasso-l-OLS) 

0.80(0.08) 

0.69(0.06) 

0.62(0.06) 

0.62(0.06) 

1.15(0.11) 

0.90(0.09) 

0.79(0.07) 

0.79(0.07) 

0.8(0.08) 

0.67(0.06) 

0.65(0.06) 

1.15(0.11) 

0.82(0.08) 

0.81(0.08) 


Unadjusted 

OLS 

cv(Lasso) 

cv(Lasso-l-OLS) 

0.80(0.07) 

0.69(0.07) 

0.62(0.06) 

0.62(0.06) 

1.15(0.11) 

0.90(0.09) 

0.80(0.08) 

0.79(0.07) 

0.8(0.07) 

0.67(0.06) 

0.65(0.06) 

1.15(0.11) 

0.82(0.08) 

0.81(0.08) 

UA = 150 

bias 

Unadjusted 

OLS 

cv(Lasso) 

cv(Lasso-l-OLS) 

0.004(0.004) 

0.002(0.003) 

0.003(0.003) 

0.011(0.004) 

0.000(0.005) 

0.006(0.005) 

0.002(0.004) 

0.006(0.004) 

0.002(0.003) 

0.01(0.005) 

0.017(0.005) 

0.005(0.005) 

0.002(0.003) 

0.001(0.003) 

SD 

Unadjusted 

OLS 

cv(Lasso) 

cv(Lasso-l-OLS) 

0.85(0.08) 

0.76(0.07) 

0.66(0.06) 

0.67(0.06) 

1.19(0.11) 

0.96(0.09) 

0.82(0.08) 

0.81(0.07) 

0.85(0.08) 

0.72(0.07) 

0.71(0.07) 

1.19(0.11) 

0.84(0.08) 

0.84(0.08) 


Unadjusted 

OLS 

cv(Lasso) 

cv(Lasso-l-OLS) 

0.85(0.08) 

0.76(0.07) 

0.66(0.06) 

0.67(0.06) 

1.19(0.11) 

0.96(0.09) 

0.82(0.08) 

0.82(0.08) 

0.85(0.08) 

0.72(0.07) 

0.71(0.07) 

1.19(0.11) 

0.84(0.08) 

0.84(0.08) 


The numbers in parentheses are the corresponding standard errors estimated by using the bootstrap with B = 500 
resamplings of the ATE estimates. 
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Table 5: Mean number of selected covariates for treated and control group 


Group 

Method 



{P,P) 


(50,0) 

(50,0.6) 

(500,0) 

(500,0.6) 

UA = 100 

treated 

cv(Lasso) 

16 

13 

22 

22 


cv(Lasso+OLS) 

6 

6 

7 

7 

control 

cv(Lasso) 

20 

11 

32 

28 


cv(Lasso+OLS) 

8 

6 

7 

7 

nA = 125 

treated 

cv(Lasso) 

17 

13 

25 

24 


cv(Lasso+OLS) 

7 

6 

6 

6 

control 

cv(Lasso) 

19 

11 

32 

27 


cv(Lasso+OLS) 

8 

6 

9 

8 

nA = 150 

treated 

cv(Lasso) 

18 

13 

29 

26 


cv(Lasso+OLS) 

8 

7 

6 

6 

control 

cv(Lasso) 

19 

12 

30 

25 


cv(Lasso+OLS) 

8 

6 

11 

8 


Table 6: Coverage 

probability (%) and mean 

interval length (in parentheses) for 95% confidence interval 




iP,p) 



Methods 

(50,0) 

(50,0.6) 

(500,0) 

(500,0.6) 

UA = 100 

Unadjusted 

97.3(3.54) 

95.8(4.79) 

97.3(3.54) 

95.8(4.79) 

OLS 

92.2(2.55) 

90.0(3.19) 

- 

- 

cv(Lasso) 

95.8(2.58) 

94.5(3.20) 

94.3(2.61) 

92.4(3.07) 

cv(Lasso+OLS) 

95.6(2.57) 

94.4(3.17) 

94.8(2.60) 

93.0(3.11) 

UA = 125 

Unadjusted 

97.4(3.56) 

96.0(4.74) 

97.3(3.56) 

95.9(4.74) 

OLS 

93.3(2.54) 

91.6(3.14) 

- 

- 

cv(Lasso) 

96.0(2.56) 

95.0(3.15) 

94.1(2.59) 

92.9(3.02) 

cv(Lasso+OLS) 

95.7(2.55) 

94.9(3.12) 

94.4(2.58) 

93.6(3.06) 

UA = 150 

Unadjusted 

97.1(3.72) 

95.8(4.88) 

97.1(3.72) 

95.8(4.88) 

OLS 

91.4(2.64) 

90.4(3.21) 

- 

- 

cv(Lasso) 

95.4(2.66) 

94.9(3.23) 

92.9(2.68) 

92.6(3.08) 

cv(Lasso+OLS) 

94.7(2.63) 

94.8(3.19) 

92.0(2.63) 

93.1(3.11) 


The numbers in parentheses are the corresponding mean interval lengths. 
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Algorithm 1 A-fold Cross Validation (CV) for the Lasso+OLS estimator 


Input: Design matrix A, response Y and a sequence of tuning parameter Ai,Aj; Number of folds K. 

Output: The optimal tuning parameter selected by CV: XopUmai- 
1: Divide randomly the data 2 = {X, Y) into K roughly equal parts Zk^k = 1,A; 

2: For each k = denote = 0 and /3Ltsso+OLs("^o) = 0. 

• Fit the model with parameters Xj,j = 1,J to the other A — 1 parts Z-k = z \ Zk oi the data, giving the 
Lasso solution path j3^^\Xj),j = 1,..., J and compute the selected covariates set S^^\Xj) = {I : j3\^\Xj) ^ 
0}, j = 1,J on the path; 

• For each j = 1,J, compute the Lasso+OLS estimator: 


arg mm 


/3LtLo+OLs(^i) = { fi- ft=0. Vj^S('“)(A+ [ 2|Z-fc| 

I ^Ssso+OLS1). Otherwise; 


if ^«(A,)^5W(A,_i), 


Compute the error in predicting the /cth part of the data : 






• (k) (\\Y. 

Lasso+OLS vAD > 


3 : Compute cross validation error CV{Xj), j = 1,J: 


4; Compute the optimal A selected by CV; 


CV{Xi) = -Y^PE^^\X,)-, 


Xopumai = argmin CV{Xj); 


5 ; return A, 


optimal • 





